Important Note: In case of execution issue due to “cannot open file” error, please download source files from Github folder A2_resources

Introduction

Data cleaning and normalization have been performed on a RNAseq dataset produced by a recent paper titled Skeletal muscle transcriptome in healthy aging, published in Nature Communications in 2021. The raw data was obtained from GEO with the ID GSE164471 See the short bio below for some basic information of this dataset and its source. #### Dataset Meta Info Contact Location : 251 Bayview Blvd,Baltimore,USA
Contact Institute : NIA-IRP, NIH
Contact Department : Laboratory of Genetics and Genomics
Contact Laboratory : Computational Biology & Genomics Core

There are two supplemental files in this dataset:
1). Genes/RNAs (ENSG) — GSE164471_GESTALT_Muscle_ENSG_counts_annotated.csv
2). Transcripts/isoforms (ENST) — GSE164471_GESTALT_Muscle_ENST_TPM_annotated.csv
In this study, the authors defined ENSGs as genes/RNAs and ENSTs as transcripts/isoforms. For the purpose of our analysis, we will focus on RNA data.

## Warning: package 'circlize' was built under R version 4.1.2

Part 1: Differential Gene Expression

1.1 —— Load normalized count data

  • Gene expression data is obtained from Skeletal muscle transcriptome in healthy aging[1].

  • Original raw data contains:
    57773 protein-coding and non-coding RNAs across 53 healthy individuals (22-83 years old)

  • Filtered data with weakly expressed (low counts) genes removed contains
    36084 protein-coding and a few non-coding RNAs across 53 healthy individuals (22-83 years old)

# RNAfilered <- read.table("https://github.com/bcb420-2022/Karen_Kuang/blob/main/A2_resources/RNAfiltered_mapped.txt", check.names = FALSE,sep = "\t") 
# Unsure why this does not work. Messes up txt file content and format. 

RNAfiltered <- read.table("RNAfiltered_mapped.txt", check.names = FALSE)
# RNAfiltered <- read.table("A2_resources/RNAfiltered_mapped.txt", check.names = FALSE)

# Column 2 ~ 54 represent individual study participant (the first column contains the ensemble IDs). BEWARE that the last 10 columns contain other info regarding RNAs in the Ensembl hg19 v82 (September 2015) database.
# colnames(RNAfile)

# Number of study participants we have : (53)
length(grep(colnames(RNAfiltered), pattern = "MUSCLE",ignore.case = TRUE))
## [1] 53
# Number of unique genes before cleaning: (57773)
# Number of unique genes after filtering: (36084)
length(RNAfiltered$ensembl_gene_id)
## [1] 36084
head(RNAfiltered)
##          ensembl_gene_id hgnc_symbol
## TSPAN6   ENSG00000000003      TSPAN6
## TNMD     ENSG00000000005        TNMD
## DPM1     ENSG00000000419        DPM1
## SCYL3    ENSG00000000457       SCYL3
## C1orf112 ENSG00000000460    C1orf112
## FGR      ENSG00000000938         FGR
##          MUSCLE_AGE22_M_GROUP20_34_CC_T415_Gt073_R1_COUNT
## TSPAN6                                                 30
## TNMD                                                   35
## DPM1                                                   68
## SCYL3                                                 191
## C1orf112                                              209
## FGR                                                    65
##          MUSCLE_AGE23_M_GROUP20_34_CC_T293_Gt022_R1_COUNT
## TSPAN6                                                 57
## TNMD                                                   14
## DPM1                                                  302
## SCYL3                                                 370
## C1orf112                                              430
## FGR                                                    25
##          MUSCLE_AGE25_F_GROUP20_34_CC_T430_Gt085_R1_COUNT
## TSPAN6                                                138
## TNMD                                                    0
## DPM1                                                  249
## SCYL3                                                 587
## C1orf112                                              471
## FGR                                                    23
##          MUSCLE_AGE25_M_GROUP20_34_AS_T355_Gt046_R1_COUNT
## TSPAN6                                                 81
## TNMD                                                    2
## DPM1                                                  262
## SCYL3                                                 499
## C1orf112                                              353
## FGR                                                    25
##          MUSCLE_AGE26_F_GROUP20_34_CC_T370_Gt051_R1_COUNT
## TSPAN6                                                 98
## TNMD                                                    9
## DPM1                                                  264
## SCYL3                                                 449
## C1orf112                                              330
## FGR                                                    49
##          MUSCLE_AGE26_F_GROUP20_34_CC_T400_Gt045_R1_COUNT
## TSPAN6                                                192
## TNMD                                                   11
## DPM1                                                  308
## SCYL3                                                 401
## C1orf112                                              324
## FGR                                                    35
##          MUSCLE_AGE27_F_GROUP20_34_CC_T385_Gt055_R1_COUNT
## TSPAN6                                                105
## TNMD                                                    4
## DPM1                                                  121
## SCYL3                                                 293
## C1orf112                                              239
## FGR                                                    38
##          MUSCLE_AGE28_M_GROUP20_34_CC_T326_Gt001_R1_COUNT
## TSPAN6                                                398
## TNMD                                                   24
## DPM1                                                  642
## SCYL3                                                1251
## C1orf112                                              934
## FGR                                                     8
##          MUSCLE_AGE31_F_GROUP20_34_AS_T640_Gt090_R1_COUNT
## TSPAN6                                                166
## TNMD                                                    7
## DPM1                                                  204
## SCYL3                                                 574
## C1orf112                                              397
## FGR                                                     2
##          MUSCLE_AGE31_M_GROUP20_34_AA_T230_Gt025_R1_COUNT
## TSPAN6                                                117
## TNMD                                                    4
## DPM1                                                  249
## SCYL3                                                 329
## C1orf112                                              251
## FGR                                                    17
##          MUSCLE_AGE31_M_GROUP20_34_AA_T328_Gt005_R1_COUNT
## TSPAN6                                                122
## TNMD                                                  269
## DPM1                                                  302
## SCYL3                                                 463
## C1orf112                                              430
## FGR                                                    12
##          MUSCLE_AGE33_M_GROUP20_34_CC_T340_Gt042_R1_COUNT
## TSPAN6                                                198
## TNMD                                                    6
## DPM1                                                  160
## SCYL3                                                 500
## C1orf112                                              422
## FGR                                                    57
##          MUSCLE_AGE35_M_GROUP35_49_CC_T670_Gt104_R1_COUNT
## TSPAN6                                                 44
## TNMD                                                    6
## DPM1                                                  153
## SCYL3                                                 285
## C1orf112                                              256
## FGR                                                   111
##          MUSCLE_AGE37_F_GROUP35_49_AA_T120_Gt037_R1_COUNT
## TSPAN6                                                156
## TNMD                                                   58
## DPM1                                                  197
## SCYL3                                                 365
## C1orf112                                              305
## FGR                                                    19
##          MUSCLE_AGE37_M_GROUP35_49_AA_T460_Gt064_R1_COUNT
## TSPAN6                                                 78
## TNMD                                                    1
## DPM1                                                  212
## SCYL3                                                 320
## C1orf112                                              224
## FGR                                                    23
##          MUSCLE_AGE38_M_GROUP35_49_CC_T232_Gt016_R1_COUNT
## TSPAN6                                                113
## TNMD                                                    3
## DPM1                                                  260
## SCYL3                                                 378
## C1orf112                                              364
## FGR                                                    22
##          MUSCLE_AGE38_M_GROUP35_49_CC_T743_Gt093_R1_COUNT
## TSPAN6                                                121
## TNMD                                                   14
## DPM1                                                  251
## SCYL3                                                 685
## C1orf112                                              373
## FGR                                                    25
##          MUSCLE_AGE42_F_GROUP35_49_CC_T445_Gt059_R1_COUNT
## TSPAN6                                                 75
## TNMD                                                   12
## DPM1                                                  269
## SCYL3                                                 679
## C1orf112                                              504
## FGR                                                    35
##          MUSCLE_AGE42_M_GROUP35_49_AA_T297_Gt024_R1_COUNT
## TSPAN6                                                133
## TNMD                                                    0
## DPM1                                                  660
## SCYL3                                                 807
## C1orf112                                              711
## FGR                                                    14
##          MUSCLE_AGE45_M_GROUP35_49_AA_T292_Gt030_R1_COUNT
## TSPAN6                                                138
## TNMD                                                   10
## DPM1                                                  302
## SCYL3                                                 600
## C1orf112                                              468
## FGR                                                    16
##          MUSCLE_AGE45_M_GROUP35_49_AA_T325_Gt017_R1_COUNT
## TSPAN6                                                130
## TNMD                                                    4
## DPM1                                                  220
## SCYL3                                                 453
## C1orf112                                              341
## FGR                                                     2
##          MUSCLE_AGE47_F_GROUP35_49_CC_T490_Gt078_R1_COUNT
## TSPAN6                                                122
## TNMD                                                   12
## DPM1                                                  188
## SCYL3                                                 481
## C1orf112                                              431
## FGR                                                    25
##          MUSCLE_AGE47_M_GROUP35_49_AA_T475_Gt060_R1_COUNT
## TSPAN6                                                146
## TNMD                                                    2
## DPM1                                                  297
## SCYL3                                                 484
## C1orf112                                              337
## FGR                                                    38
##          MUSCLE_AGE51_M_GROUP50_64_CC_T289_Gt015_R1_COUNT
## TSPAN6                                                124
## TNMD                                                    5
## DPM1                                                  338
## SCYL3                                                 617
## C1orf112                                              524
## FGR                                                    13
##          MUSCLE_AGE52_F_GROUP50_64_AA_T535_Gt063_R1_COUNT
## TSPAN6                                                 69
## TNMD                                                   72
## DPM1                                                   68
## SCYL3                                                 336
## C1orf112                                              240
## FGR                                                    27
##          MUSCLE_AGE52_F_GROUP50_64_CC_T331_Gt027_R1_COUNT
## TSPAN6                                                 84
## TNMD                                                    0
## DPM1                                                  229
## SCYL3                                                 237
## C1orf112                                              207
## FGR                                                    25
##          MUSCLE_AGE52_M_GROUP50_64_AA_T505_Gt056_R1_COUNT
## TSPAN6                                                152
## TNMD                                                   39
## DPM1                                                  418
## SCYL3                                                 965
## C1orf112                                              759
## FGR                                                    78
##          MUSCLE_AGE52_M_GROUP50_64_CC_T329_Gt007_R1_COUNT
## TSPAN6                                                 58
## TNMD                                                    6
## DPM1                                                  271
## SCYL3                                                 321
## C1orf112                                              251
## FGR                                                     5
##          MUSCLE_AGE54_M_GROUP50_64_AA_T520_Gt057_R1_COUNT
## TSPAN6                                                 67
## TNMD                                                    3
## DPM1                                                  344
## SCYL3                                                 484
## C1orf112                                              428
## FGR                                                    40
##          MUSCLE_AGE57_M_GROUP50_64_CC_T266_Gt004_R1_COUNT
## TSPAN6                                                107
## TNMD                                                    9
## DPM1                                                  395
## SCYL3                                                 772
## C1orf112                                              632
## FGR                                                    11
##          MUSCLE_AGE58_F_GROUP50_64_CC_T330_Gt011_R1_COUNT
## TSPAN6                                                206
## TNMD                                                   15
## DPM1                                                  198
## SCYL3                                                 584
## C1orf112                                              489
## FGR                                                   104
##          MUSCLE_AGE60_M_GROUP50_64_CC_T300_Gt013_R1_COUNT
## TSPAN6                                                263
## TNMD                                                    7
## DPM1                                                  329
## SCYL3                                                 494
## C1orf112                                              371
## FGR                                                     9
##          MUSCLE_AGE62_F_GROUP50_64_CC_T550_Gt070_R1_COUNT
## TSPAN6                                                104
## TNMD                                                    6
## DPM1                                                  373
## SCYL3                                                 968
## C1orf112                                              651
## FGR                                                    42
##          MUSCLE_AGE63_F_GROUP50_64_CC_T676_Gt101_R1_COUNT
## TSPAN6                                                111
## TNMD                                                    5
## DPM1                                                  188
## SCYL3                                                 355
## C1orf112                                              256
## FGR                                                    62
##          MUSCLE_AGE63_M_GROUP50_64_CC_T685_Gt092_R1_COUNT
## TSPAN6                                                111
## TNMD                                                  102
## DPM1                                                  236
## SCYL3                                                 508
## C1orf112                                              447
## FGR                                                    30
##          MUSCLE_AGE67_F_GROUP65_79_CC_T610_Gt080_R1_COUNT
## TSPAN6                                                 99
## TNMD                                                    6
## DPM1                                                  254
## SCYL3                                                 548
## C1orf112                                              414
## FGR                                                    35
##          MUSCLE_AGE67_M_GROUP65_79_AA_T295_Gt039_R1_COUNT
## TSPAN6                                                150
## TNMD                                                   29
## DPM1                                                  342
## SCYL3                                                 434
## C1orf112                                              375
## FGR                                                     7
##          MUSCLE_AGE67_M_GROUP65_79_CC_T228_Gt002_R1_COUNT
## TSPAN6                                                122
## TNMD                                                   19
## DPM1                                                  375
## SCYL3                                                 686
## C1orf112                                              644
## FGR                                                    46
##          MUSCLE_AGE69_F_GROUP65_79_CC_T715_Gt095_R1_COUNT
## TSPAN6                                                108
## TNMD                                                    7
## DPM1                                                  172
## SCYL3                                                 321
## C1orf112                                              344
## FGR                                                    41
##          MUSCLE_AGE70_M_GROUP65_79_AS_T625_Gt089_R1_COUNT
## TSPAN6                                                 80
## TNMD                                                    6
## DPM1                                                  221
## SCYL3                                                 519
## C1orf112                                              444
## FGR                                                    39
##          MUSCLE_AGE70_M_GROUP65_79_CC_T595_Gt061_R1_COUNT
## TSPAN6                                                101
## TNMD                                                    6
## DPM1                                                  247
## SCYL3                                                 456
## C1orf112                                              410
## FGR                                                    43
##          MUSCLE_AGE72_F_GROUP65_79_CC_T296_Gt018_R1_COUNT
## TSPAN6                                                114
## TNMD                                                    1
## DPM1                                                  363
## SCYL3                                                 417
## C1orf112                                              522
## FGR                                                    11
##          MUSCLE_AGE72_F_GROUP65_79_CC_T580_Gt071_R1_COUNT
## TSPAN6                                                122
## TNMD                                                    7
## DPM1                                                  246
## SCYL3                                                 468
## C1orf112                                              360
## FGR                                                    29
##          MUSCLE_AGE72_M_GROUP65_79_CC_T298_Gt014_R1_COUNT
## TSPAN6                                                112
## TNMD                                                    5
## DPM1                                                  247
## SCYL3                                                 320
## C1orf112                                              295
## FGR                                                    30
##          MUSCLE_AGE72_M_GROUP65_79_CC_T730_Gt084_R1_COUNT
## TSPAN6                                                129
## TNMD                                                    2
## DPM1                                                  213
## SCYL3                                                 371
## C1orf112                                              282
## FGR                                                    37
##          MUSCLE_AGE72_M_GROUP65_79_CC_T760_Gt109_R1_COUNT
## TSPAN6                                                106
## TNMD                                                   18
## DPM1                                                  255
## SCYL3                                                 554
## C1orf112                                              399
## FGR                                                    64
##          MUSCLE_AGE73_M_GROUP65_79_CC_T565_Gt065_R1_COUNT
## TSPAN6                                                 66
## TNMD                                                    1
## DPM1                                                  179
## SCYL3                                                 516
## C1orf112                                              383
## FGR                                                    42
##          MUSCLE_AGE76_M_GROUP65_79_CC_T745_Gt098_R1_COUNT
## TSPAN6                                                 52
## TNMD                                                   25
## DPM1                                                  109
## SCYL3                                                 426
## C1orf112                                              278
## FGR                                                    51
##          MUSCLE_AGE80_F_GROUP80_PLUS_CC_T290_Gt032_R1_COUNT
## TSPAN6                                                  110
## TNMD                                                     51
## DPM1                                                     40
## SCYL3                                                   206
## C1orf112                                                219
## FGR                                                      39
##          MUSCLE_AGE81_M_GROUP80_PLUS_CC_T291_Gt029_R1_COUNT
## TSPAN6                                                   93
## TNMD                                                      6
## DPM1                                                    335
## SCYL3                                                   457
## C1orf112                                                436
## FGR                                                       6
##          MUSCLE_AGE81_M_GROUP80_PLUS_CC_T790_Gt110_R1_COUNT
## TSPAN6                                                   40
## TNMD                                                      8
## DPM1                                                    111
## SCYL3                                                   363
## C1orf112                                                262
## FGR                                                      58
##          MUSCLE_AGE81_M_GROUP80_PLUS_CC_T805_Gt108_R1_COUNT
## TSPAN6                                                   52
## TNMD                                                     18
## DPM1                                                    117
## SCYL3                                                   314
## C1orf112                                                212
## FGR                                                      29
##          MUSCLE_AGE83_M_GROUP80_PLUS_CC_T294_Gt026_R1_COUNT HG19en82 Chromosome
## TSPAN6                                                  117                chrX
## TNMD                                                     15                chrX
## DPM1                                                    345               chr20
## SCYL3                                                   745                chr1
## C1orf112                                                636                chr1
## FGR                                                       4                chr1
##          HG19en82 Gene Start (bp) HG19en82 Gene End (bp) HG19en82 Gene Length
## TSPAN6                   99883667               99894988                 2968
## TNMD                     99839799               99854882                 1610
## DPM1                     49551404               49575092                 1207
## SCYL3                   169818772              169863408                 6876
## C1orf112                169631245              169823221                 6354
## FGR                      27938575               27961788                 3474
##          HG19en82 Strand HG19en82 % GC Content HG19en82 Cytoband
## TSPAN6                 -                 40.87             q22.1
## TNMD                   +                 40.80             q22.1
## DPM1                   -                 39.85            q13.13
## SCYL3                  -                 40.14             q24.2
## C1orf112               +                 39.22             q24.2
## FGR                    -                 52.97            p36.11
##          HG19en82 Gene Biotype HG19en82 Gene Name
## TSPAN6          protein_coding             TSPAN6
## TNMD            protein_coding               TNMD
## DPM1            protein_coding               DPM1
## SCYL3           protein_coding              SCYL3
## C1orf112        protein_coding           C1orf112
## FGR             protein_coding                FGR
##                                                                                           HG19en82 Description
## TSPAN6                                                            tetraspanin 6 [Source:HGNC Symbol;Acc:11858]
## TNMD                                                                tenomodulin [Source:HGNC Symbol;Acc:17757]
## DPM1     dolichyl-phosphate mannosyltransferase polypeptide 1, catalytic subunit [Source:HGNC Symbol;Acc:3005]
## SCYL3                                               SCY1-like 3 (S. cerevisiae) [Source:HGNC Symbol;Acc:19285]
## C1orf112                                    chromosome 1 open reading frame 112 [Source:HGNC Symbol;Acc:25565]
## FGR                        feline Gardner-Rasheed sarcoma viral oncogene homolog [Source:HGNC Symbol;Acc:3697]
# Load normalized count data for filtered genes. The trimmed mean method (TMM) was used for normalization.

# Restore proper formatting from text file
norm_count <- read.table("normalized_counts.txt", check.names = FALSE)
# norm_count <- read.table("A2_resources/normalized_counts.txt", check.names = FALSE)
colnames(norm_count) <- norm_count[1, 1:ncol(norm_count)]
rownames(norm_count) <- norm_count[1:nrow(norm_count), 1]
norm_count <- norm_count[-1,-1]

head(norm_count)
##          MUSCLE_AGE22_M_GROUP20_34_CC_T415_Gt073_R1_COUNT
## TSPAN6                                  0.495633004144666
## TNMD                                    0.578238504835443
## DPM1                                     1.12343480939458
## SCYL3                                    3.15553012638771
## C1orf112                                  3.4529099288745
## FGR                                      1.07387150898011
##          MUSCLE_AGE23_M_GROUP20_34_CC_T293_Gt022_R1_COUNT
## TSPAN6                                   1.08734841704813
## TNMD                                    0.267068032257436
## DPM1                                     5.76103898155326
## SCYL3                                    7.05822656680366
## C1orf112                                 8.20280384790695
## FGR                                     0.476907200459707
##          MUSCLE_AGE25_F_GROUP20_34_CC_T430_Gt085_R1_COUNT
## TSPAN6                                   2.52854656354576
## TNMD                                                    0
## DPM1                                     4.56237749509345
## SCYL3                                    10.7554842956621
## C1orf112                                  8.6300393581888
## FGR                                     0.421424427257627
##          MUSCLE_AGE25_M_GROUP20_34_AS_T355_Gt046_R1_COUNT
## TSPAN6                                   1.66130477843066
## TNMD                                    0.041019871072362
## DPM1                                     5.37360311047942
## SCYL3                                    10.2344578325543
## C1orf112                                 7.24000724427189
## FGR                                     0.512748388404525
##          MUSCLE_AGE26_F_GROUP20_34_CC_T370_Gt051_R1_COUNT
## TSPAN6                                   1.48630884670204
## TNMD                                    0.136497751227738
## DPM1                                     4.00393403601365
## SCYL3                                    6.80972114458382
## C1orf112                                 5.00491754501706
## FGR                                     0.743154423351019
##          MUSCLE_AGE26_F_GROUP20_34_CC_T400_Gt045_R1_COUNT
## TSPAN6                                   3.39248571086676
## TNMD                                    0.194361160518408
## DPM1                                     5.44211249451543
## SCYL3                                    7.08534776071652
## C1orf112                                 5.72481963708766
## FGR                                     0.618421874376753
##          MUSCLE_AGE27_F_GROUP20_34_CC_T385_Gt055_R1_COUNT
## TSPAN6                                   2.43624428127776
## TNMD                                   0.0928093059534386
## DPM1                                     2.80748150509152
## SCYL3                                    6.79828166108938
## C1orf112                                 5.54535603071796
## FGR                                     0.881688406557667
##          MUSCLE_AGE28_M_GROUP20_34_CC_T326_Gt001_R1_COUNT
## TSPAN6                                   4.58056938804357
## TNMD                                    0.276215239480014
## DPM1                                     7.38875765609038
## SCYL3                                    14.3977193578957
## C1orf112                                 10.7493764030972
## FGR                                    0.0920717464933381
##          MUSCLE_AGE31_F_GROUP20_34_AS_T640_Gt090_R1_COUNT
## TSPAN6                                   2.97104297440637
## TNMD                                    0.125284944703883
## DPM1                                     3.65116124565602
## SCYL3                                    10.2733654657184
## C1orf112                                 7.10544614963452
## FGR                                    0.0357956984868238
##          MUSCLE_AGE31_M_GROUP20_34_AA_T230_Gt025_R1_COUNT
## TSPAN6                                   3.70045713497259
## TNMD                                    0.126511355041798
## DPM1                                     7.87533185135192
## SCYL3                                    10.4055589521879
## C1orf112                                 7.93858752887282
## FGR                                     0.537673258927641
##          MUSCLE_AGE31_M_GROUP20_34_AA_T328_Gt005_R1_COUNT
## TSPAN6                                   2.02414123204382
## TNMD                                     4.46306550344088
## DPM1                                     5.01057911538716
## SCYL3                                    7.68178188882204
## C1orf112                                 7.13426827687576
## FGR                                     0.199095858889556
##          MUSCLE_AGE33_M_GROUP20_34_CC_T340_Gt042_R1_COUNT
## TSPAN6                                   3.04293369487199
## TNMD                                    0.092210111965818
## DPM1                                     2.45893631908848
## SCYL3                                     7.6841759971515
## C1orf112                                 6.48544454159586
## FGR                                     0.875996063675271
##          MUSCLE_AGE35_M_GROUP35_49_CC_T670_Gt104_R1_COUNT
## TSPAN6                                   0.69301546223257
## TNMD                                   0.0945021084862595
## DPM1                                     2.40980376639962
## SCYL3                                    4.48885015309733
## C1orf112                                 4.03208996208041
## FGR                                       1.7482890069958
##          MUSCLE_AGE37_F_GROUP35_49_AA_T120_Gt037_R1_COUNT
## TSPAN6                                   2.85926366961664
## TNMD                                     1.06305956947285
## DPM1                                     3.61073681355434
## SCYL3                                    6.68994384237226
## C1orf112                                 5.59022704636586
## FGR                                     0.348243652068693
##          MUSCLE_AGE37_M_GROUP35_49_AA_T460_Gt064_R1_COUNT
## TSPAN6                                   1.72294853327264
## TNMD                                   0.0220890837599056
## DPM1                                         4.6828857571
## SCYL3                                    7.06850680316981
## C1orf112                                 4.94795476221887
## FGR                                      0.50804892647783
##          MUSCLE_AGE38_M_GROUP35_49_CC_T232_Gt016_R1_COUNT
## TSPAN6                                   2.93231394109958
## TNMD                                   0.0778490426840596
## DPM1                                      6.7469170326185
## SCYL3                                    9.80897937819151
## C1orf112                                  9.4456838456659
## FGR                                     0.570892979683104
##          MUSCLE_AGE38_M_GROUP35_49_CC_T743_Gt093_R1_COUNT
## TSPAN6                                   1.65781066052541
## TNMD                                    0.191812803697155
## DPM1                                     3.43892955199899
## SCYL3                                    9.38512646661079
## C1orf112                                 5.11044112707419
## FGR                                     0.342522863744919
##          MUSCLE_AGE42_F_GROUP35_49_CC_T445_Gt059_R1_COUNT
## TSPAN6                                   1.21152253824014
## TNMD                                    0.193843606118422
## DPM1                                      4.3453275038213
## SCYL3                                    10.9683173795341
## C1orf112                                 8.14143145697374
## FGR                                     0.565377184512065
##          MUSCLE_AGE42_M_GROUP35_49_AA_T297_Gt024_R1_COUNT
## TSPAN6                                   1.93664808282571
## TNMD                                                    0
## DPM1                                     9.61043409522533
## SCYL3                                    11.7509398709801
## C1orf112                                 10.3530585480382
## FGR                                     0.203857692929022
##          MUSCLE_AGE45_M_GROUP35_49_AA_T292_Gt030_R1_COUNT
## TSPAN6                                   2.04923007325138
## TNMD                                    0.148494932844303
## DPM1                                     4.48454697189795
## SCYL3                                    8.90969597065817
## C1orf112                                 6.94956285711338
## FGR                                     0.237591892550885
##          MUSCLE_AGE45_M_GROUP35_49_AA_T325_Gt017_R1_COUNT
## TSPAN6                                   3.91721639905587
## TNMD                                    0.120529735355565
## DPM1                                     6.62913544455609
## SCYL3                                    13.6499925290178
## C1orf112                                 10.2751599390619
## FGR                                    0.0602648676777826
##          MUSCLE_AGE47_F_GROUP35_49_CC_T490_Gt078_R1_COUNT
## TSPAN6                                   2.08780115679835
## TNMD                                    0.205357490832625
## DPM1                                     3.21726735637779
## SCYL3                                    8.23141275754104
## C1orf112                                 7.37575654573843
## FGR                                     0.427828105901301
##          MUSCLE_AGE47_M_GROUP35_49_AA_T475_Gt060_R1_COUNT
## TSPAN6                                   2.26094084644405
## TNMD                                   0.0309717924170417
## DPM1                                      4.5993111739307
## SCYL3                                     7.4951737649241
## C1orf112                                 5.21874702227153
## FGR                                     0.588464055923793
##          MUSCLE_AGE51_M_GROUP50_64_CC_T289_Gt015_R1_COUNT
## TSPAN6                                   2.44554975221804
## TNMD                                    0.098610877105566
## DPM1                                     6.66609529233626
## SCYL3                                    12.1685822348269
## C1orf112                                 10.3344199206633
## FGR                                     0.256388280474472
##          MUSCLE_AGE52_F_GROUP50_64_AA_T535_Gt063_R1_COUNT
## TSPAN6                                    1.8385794802239
## TNMD                                     1.91851771849451
## DPM1                                     1.81193340080037
## SCYL3                                    8.95308268630771
## C1orf112                                 6.39505906164836
## FGR                                     0.719444144435441
##          MUSCLE_AGE52_F_GROUP50_64_CC_T331_Gt027_R1_COUNT
## TSPAN6                                    2.2635487463469
## TNMD                                                    0
## DPM1                                     6.17086503468381
## SCYL3                                    6.38644110576447
## C1orf112                                   5.578030839212
## FGR                                     0.673675222127054
##          MUSCLE_AGE52_M_GROUP50_64_AA_T505_Gt056_R1_COUNT
## TSPAN6                                   1.66829155323219
## TNMD                                    0.428048490631943
## DPM1                                     4.58780177138851
## SCYL3                                    10.5914562425596
## C1orf112                                 8.33048216383704
## FGR                                     0.856096981263885
##          MUSCLE_AGE52_M_GROUP50_64_CC_T329_Gt007_R1_COUNT
## TSPAN6                                   1.75026757200975
## TNMD                                    0.181062162621698
## DPM1                                     8.17797434508003
## SCYL3                                    9.68682570026085
## C1orf112                                  7.5744338030077
## FGR                                     0.150885135518082
##          MUSCLE_AGE54_M_GROUP50_64_AA_T520_Gt057_R1_COUNT
## TSPAN6                                  0.814397013907725
## TNMD                                   0.0364655379361668
## DPM1                                     4.18138168334713
## SCYL3                                    5.88310678703491
## C1orf112                                  5.2024167455598
## FGR                                     0.486207172482224
##          MUSCLE_AGE57_M_GROUP50_64_CC_T266_Gt004_R1_COUNT
## TSPAN6                                   1.90107736665636
## TNMD                                    0.159903703737451
## DPM1                                     7.01799588625478
## SCYL3                                    13.7161843650347
## C1orf112                                 11.2287934180077
## FGR                                     0.195437860123551
##          MUSCLE_AGE58_F_GROUP50_64_CC_T330_Gt011_R1_COUNT
## TSPAN6                                   3.74986148520786
## TNMD                                     0.27304816639863
## DPM1                                     3.60423579646192
## SCYL3                                    10.6306752784533
## C1orf112                                 8.90137022459535
## FGR                                      1.89313395369717
##          MUSCLE_AGE60_M_GROUP50_64_CC_T300_Gt013_R1_COUNT
## TSPAN6                                   3.36262730901047
## TNMD                                   0.0894995861713813
## DPM1                                     4.20648055005492
## SCYL3                                    6.31611365266605
## C1orf112                                 4.74347806708321
## FGR                                     0.115070896506062
##          MUSCLE_AGE62_F_GROUP50_64_CC_T550_Gt070_R1_COUNT
## TSPAN6                                   1.34756987621238
## TNMD                                   0.0777444159353297
## DPM1                                     4.83311119064633
## SCYL3                                    12.5427657708999
## C1orf112                                 8.43526912898327
## FGR                                     0.544210911547308
##          MUSCLE_AGE63_F_GROUP50_64_CC_T676_Gt101_R1_COUNT
## TSPAN6                                   1.88087653285919
## TNMD                                   0.0847241681468106
## DPM1                                     3.18562872232008
## SCYL3                                    6.01541593842355
## C1orf112                                  4.3378774091167
## FGR                                      1.05057968502045
##          MUSCLE_AGE63_M_GROUP50_64_CC_T685_Gt092_R1_COUNT
## TSPAN6                                   1.61913463112672
## TNMD                                     1.48785344481915
## DPM1                                     3.44248444095411
## SCYL3                                     7.4100936271385
## C1orf112                                 6.52029891994274
## FGR                                     0.437603954358573
##          MUSCLE_AGE67_F_GROUP65_79_CC_T610_Gt080_R1_COUNT
## TSPAN6                                   1.62612528375301
## TNMD                                   0.0985530475001827
## DPM1                                     4.17207901084107
## SCYL3                                    9.00117833835002
## C1orf112                                 6.80016027751261
## FGR                                     0.574892777084399
##          MUSCLE_AGE67_M_GROUP65_79_AA_T295_Gt039_R1_COUNT
## TSPAN6                                    2.2023732043387
## TNMD                                    0.425792152838816
## DPM1                                     5.02141090589224
## SCYL3                                    6.37219980455331
## C1orf112                                 5.50593301084676
## FGR                                     0.102777416202473
##          MUSCLE_AGE67_M_GROUP65_79_CC_T228_Gt002_R1_COUNT
## TSPAN6                                   2.24302570890385
## TNMD                                     0.34932367597683
## DPM1                                      6.8945462363848
## SCYL3                                    12.6124232484266
## C1orf112                                 11.8402340699515
## FGR                                     0.845731004996535
##          MUSCLE_AGE69_F_GROUP65_79_CC_T715_Gt095_R1_COUNT
## TSPAN6                                    2.1753375088858
## TNMD                                    0.140994097798154
## DPM1                                     3.46442640304035
## SCYL3                                    6.46558648474392
## C1orf112                                 6.92885280608071
## FGR                                     0.825822572817759
##          MUSCLE_AGE70_M_GROUP65_79_AS_T625_Gt089_R1_COUNT
## TSPAN6                                   1.47235523978391
## TNMD                                    0.110426642983793
## DPM1                                     4.06738134990306
## SCYL3                                    9.55190461809813
## C1orf112                                 8.17157158080071
## FGR                                     0.717773179394657
##          MUSCLE_AGE70_M_GROUP65_79_CC_T595_Gt061_R1_COUNT
## TSPAN6                                   2.09849085464065
## TNMD                                     0.12466282304796
## DPM1                                       5.131952882141
## SCYL3                                    9.47437455164493
## C1orf112                                 8.51862624161057
## FGR                                     0.893416898510377
##          MUSCLE_AGE72_F_GROUP65_79_CC_T296_Gt018_R1_COUNT
## TSPAN6                                   2.84101672473088
## TNMD                                   0.0249211993397446
## DPM1                                     9.04639536032728
## SCYL3                                    10.3921401246735
## C1orf112                                 13.0088660553467
## FGR                                      0.27413319273719
##          MUSCLE_AGE72_F_GROUP65_79_CC_T580_Gt071_R1_COUNT
## TSPAN6                                   2.39477254491478
## TNMD                                    0.137404982085274
## DPM1                                     4.82880365613965
## SCYL3                                    9.18650451655835
## C1orf112                                 7.06654193581412
## FGR                                     0.569249211496137
##          MUSCLE_AGE72_M_GROUP65_79_CC_T298_Gt014_R1_COUNT
## TSPAN6                                   2.88336976944442
## TNMD                                     0.12872186470734
## DPM1                                     6.35886011654261
## SCYL3                                    8.23819934126978
## C1orf112                                 7.59459001773307
## FGR                                     0.772331188244042
##          MUSCLE_AGE72_M_GROUP65_79_CC_T730_Gt084_R1_COUNT
## TSPAN6                                   2.12175820816641
## TNMD                                   0.0328954760956033
## DPM1                                     3.50336820418175
## SCYL3                                    6.10211081573441
## C1orf112                                 4.63826212948006
## FGR                                      0.60856630776866
##          MUSCLE_AGE72_M_GROUP65_79_CC_T760_Gt109_R1_COUNT
## TSPAN6                                   1.48049087054646
## TNMD                                    0.251404110092795
## DPM1                                      3.5615582263146
## SCYL3                                    7.73765983285603
## C1orf112                                 5.57279110705696
## FGR                                     0.893881280329938
##          MUSCLE_AGE73_M_GROUP65_79_CC_T565_Gt065_R1_COUNT
## TSPAN6                                    1.1221350455523
## TNMD                                   0.0170020461447319
## DPM1                                     3.04336625990701
## SCYL3                                    8.77305581068165
## C1orf112                                 6.51178367343231
## FGR                                     0.714085938078739
##          MUSCLE_AGE76_M_GROUP65_79_CC_T745_Gt098_R1_COUNT
## TSPAN6                                  0.914941886947357
## TNMD                                    0.439875907186229
## DPM1                                     1.91785895533196
## SCYL3                                    7.49548545845335
## C1orf112                                 4.89142008791087
## FGR                                     0.897346850659908
##          MUSCLE_AGE80_F_GROUP80_PLUS_CC_T290_Gt032_R1_COUNT
## TSPAN6                                     1.98110467083993
## TNMD                                      0.918512165571242
## DPM1                                      0.720401698487249
## SCYL3                                      3.71006874720933
## C1orf112                                   3.94419929921769
## FGR                                       0.702391656025068
##          MUSCLE_AGE81_M_GROUP80_PLUS_CC_T291_Gt029_R1_COUNT
## TSPAN6                                     2.25340931993861
## TNMD                                      0.145381246447652
## DPM1                                       8.11711959332727
## SCYL3                                      11.0732049377629
## C1orf112                                   10.5643705751961
## FGR                                       0.145381246447652
##          MUSCLE_AGE81_M_GROUP80_PLUS_CC_T790_Gt110_R1_COUNT
## TSPAN6                                    0.927187455996697
## TNMD                                      0.185437491199339
## DPM1                                       2.57294519039084
## SCYL3                                      8.41422616317003
## C1orf112                                   6.07307783677837
## FGR                                        1.34442181119521
##          MUSCLE_AGE81_M_GROUP80_PLUS_CC_T805_Gt108_R1_COUNT
## TSPAN6                                     1.31350438923174
## TNMD                                      0.454674596272524
## DPM1                                       2.95538487577141
## SCYL3                                      7.93154573497626
## C1orf112                                   5.35505635609862
## FGR                                       0.732531293994623
##          MUSCLE_AGE83_M_GROUP80_PLUS_CC_T294_Gt026_R1_COUNT
## TSPAN6                                     1.95716825454737
## TNMD                                      0.250919006993253
## DPM1                                       5.77113716084482
## SCYL3                                      12.4623106806649
## C1orf112                                   10.6389658965139
## FGR                                      0.0669117351982008

1.2 —— Revisit Graphical Data Summary (MDS plot)

# First define the grouping of samples: 
grouping <-data.frame(lapply(colnames(norm_count),
                                      FUN=function(x){unlist(strsplit(x, split = "_"))[c(3,4)]}))
colnames(grouping) <- colnames(norm_count)
rownames(grouping) <- c("sex","age")
grouping <- data.frame(t(grouping))

filtered_RNA_matrix <- as.matrix(RNAfiltered[3:55])
filtered_RNA_matrix[1:3,1:3]
##        MUSCLE_AGE22_M_GROUP20_34_CC_T415_Gt073_R1_COUNT
## TSPAN6                                               30
## TNMD                                                 35
## DPM1                                                 68
##        MUSCLE_AGE23_M_GROUP20_34_CC_T293_Gt022_R1_COUNT
## TSPAN6                                               57
## TNMD                                                 14
## DPM1                                                302
##        MUSCLE_AGE25_F_GROUP20_34_CC_T430_Gt085_R1_COUNT
## TSPAN6                                              138
## TNMD                                                  0
## DPM1                                                249
# Create the DEGList object and calculate normalization factors:
d = DGEList(counts=filtered_RNA_matrix, group=grouping$age)
d = calcNormFactors(d, method = "TMM") #trimmed mean method 

short <- sapply(strsplit(colnames(RNAfiltered)[2:54], split = "_"), 
                function(x) paste(x[4], collapse = '_'))

# No clear cluster by age group...
plotMDS(data.matrix(norm_count), labels=short, cex=0.6,
        col=c("blue", "green", "pink", "red", "black")[factor(grouping$age)])
title(main = "MDS plot -- Group by Age") 
legend("topleft", legend=c("Age 20~34", "Age 35~49", "Age 50~64", "Age 65~79", "Age > 80"), fill=c("blue", "green", "pink", "red", "black"), cex=0.8)

# No clustering by sex either...
plotMDS(data.matrix(norm_count), labels=short, cex=0.8,
        col=c("pink", "blue")[factor(grouping$sex)])
title(main = "MDS plot --- Group by Sex") 
legend("topleft", legend=c("Female", "Male"), fill=c("pink", "blue"), cex=0.8)

# hmm.., sex and age are the only covariates used in this experiment, so this RNA-seq data must be very dispersed

1.3 —— Model Designs

Part I : Genewise Negative Binomial Generalized Linear Models with Quasi-likelihood Tests

### GENERALIZED LINEAR MODEL using Age & Sex 
model_design <- model.matrix(~grouping$age+grouping$sex+0)
kable(model_design[1:5, 1:6], type = "html") %>% kableExtra::row_spec(0, angle=-20)
grouping\(ageGROUP20 </th> <th style="text-align:right;-webkit-transform: rotate(-20deg); -moz-transform: rotate(-20deg); -ms-transform: rotate(-20deg); -o-transform: rotate(-20deg); transform: rotate(-20deg);"> grouping\)ageGROUP35 grouping\(ageGROUP50 </th> <th style="text-align:right;-webkit-transform: rotate(-20deg); -moz-transform: rotate(-20deg); -ms-transform: rotate(-20deg); -o-transform: rotate(-20deg); transform: rotate(-20deg);"> grouping\)ageGROUP65 grouping\(ageGROUP80 </th> <th style="text-align:right;-webkit-transform: rotate(-20deg); -moz-transform: rotate(-20deg); -ms-transform: rotate(-20deg); -o-transform: rotate(-20deg); transform: rotate(-20deg);"> grouping\)sexM
1 0 0 0 0 1
1 0 0 0 0 1
1 0 0 0 0 0
1 0 0 0 0 1
1 0 0 0 0 0
# Estimate dispersion according to model design
if(!exists('est_d')){
  est_d <- estimateDisp(d, model_design)
}

# Fit the model to data
fit <- glmQLFit(est_d, model_design)

# Use the Genewise Negative Binomial Generalized Linear Models with Quasi-likelihood Tests from edgeR to calculate differential expression of bulk RNA-seq data
qlf_output <- glmQLFTest(fit)

# Benjamini Hochberg Multiple Testing Correction is applied by function topTags()
qlf_output_hits <- topTags(qlf_output, 
                           adjust.method = "BH", 
                           sort.by = "PValue", 
                           n = nrow(norm_count))
qlf_output_hits$table[1:5,]
##           logFC     logCPM        F       PValue          FDR
## ZFY    6.197065  1.9294452 276.6287 1.362372e-23 2.590671e-19
## TTTY14 8.983947  1.4749536 276.0184 1.435911e-23 2.590671e-19
## HSFY2  8.520869 -0.4014364 240.8904 3.527209e-22 4.242527e-18
## EIF1AY 9.526244  3.0596327 235.4057 6.015401e-22 5.426493e-18
## KDM5D  7.188870  3.8722766 216.3795 4.156852e-21 2.999917e-17
### False Discovery Rate (FDR) p-Value Correction:
# Note: Benjamini-Hochberg (aka "fdr") method is used here to control the false discovery rate, meaning the expected proportion of false discoveries among rejected hypotheses.
plot(qlf_output_hits$table$logFC,-log10(qlf_output_hits$table$FDR),xlab="Log Fold Change (logFC)", ylab="-log10(adj p-val)",
     main="FDR Adjusted p-Values from Linear Regressions")
legend("topleft",c("p=0.05","p=0.01"),col=c("red","blue"),
        lty=1:1,cex=0.8)
abline(1.3,0,col="red")
abline(2,0,col="blue")
abline(v=0,col="black")

GENERALIZED LINEAR MODEL OUTPUT REPORT

# Genes that passed the threshold p-value < 0.05: 2292
# Genes that passed the threshold p-value < 0.01: 899
# Genes up-regulated/over-represented with age, under p-value < 0.01: 578
# Genes down-regulated/under-represented with age, under p-value < 0.01: 321

# Genes that passed the BH multiple testing correction: 397

# 2292 genes passed the threshold p-value < 0.05
glm_total_0.05 <- length(which(qlf_output_hits$table$PValue < 0.05))
# 899 genes passed the threshold p-value < 0.01
glm_total_0.01 <- length(which(qlf_output_hits$table$PValue < 0.01))
# 397 genes passed BH multiple testing correction 
glm_corrected <- length(which(qlf_output_hits$table$FDR < 0.05))

# Rank output hits by +/- log fold change
qlf_output_ranked <- qlf_output_hits$table
qlf_output_ranked[, "rank"] <- 
  -log(qlf_output_ranked$PValue, base = 10) * sign(qlf_output_ranked$logFC)
qlf_output_ranked <- qlf_output_ranked[order(qlf_output_ranked$rank),]

# We will proceed with gene lists thresholded by P=0.01 to be more stringent with our selection: 
# Under p-value < 0.01, there are 578 up-regulated/over-represented genes 
glm_over <- rownames(qlf_output_ranked)[which(qlf_output_ranked$PValue < 0.01
                                              & qlf_output_ranked$logFC > 0)]
# Under p-value < 0.01, there are 321 down-regulated/under-represented genes 
glm_under <- rownames(qlf_output_ranked)[which(qlf_output_ranked$PValue < 0.01
                                              & qlf_output_ranked$logFC < 0)]

glm_total <- rownames(qlf_output_ranked)[which(qlf_output_ranked$PValue < 0.01)]
write.table(glm_total,
            file = file.path("/Users/apple/bcb420_code/DE_data","total_byGLM.txt"),sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)

# Write analyzed gene list as text files
write.table(glm_over,
            file = file.path("/Users/apple/bcb420_code/DE_data","skeletalMuscle_upregulated_genes_byGLM.txt"), sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)

write.table(glm_under,
            file = file.path("/Users/apple/bcb420_code/DE_data","skeletalMuscle_downregulated_genes_byGLM.txt"),sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)


### GENERALIZED LINEAR Model volcano plot
glin.info <- data.frame(PValue = qlf_output$table$PValue, logFC = qlf_output_ranked$logFC)
# head(lin.info)
glin.info$threshold <- factor(ifelse(glin.info$PValue<0.01 & glin.info$logFC < 0, 1, 
                                 ifelse(glin.info$PValue<0.01 & glin.info$logFC > 0,-1,0)))
options(scipen=999)  # turn-off scientific notation like 1e+48

glinear_Vplot <- ggplot(glin.info, aes(x=logFC, y=-log10(PValue))) + 
  geom_point(aes(color=glin.info$threshold),show.legend = T)+
  scale_color_manual(values = c("0" = "black", "-1" = "red", "1"="blue"), labels = c("Not significant", "Over-represented", "Under-represented"))+
  scale_shape(solid = F)+
  geom_hline(aes(yintercept= -log10(0.01), linetype = "p = 0.01"), colour= 'black',linetype="dashed") +
  geom_vline(xintercept = 0,color="black")+
  xlim(c(-0.07,0.07))+ylim(c(0,10))+
  theme(text = element_text(size=10))+
  labs(y="-log10(p-val)", 
       x="log Fold Change",
       title = "Negative Binomial Generalized Linear Model of Significant RNAs",
       color = "Legend")

plot(glinear_Vplot) 

### Heat Map of top hits using Generalized Linear Model (P < 0.05)
top_hits <- rownames(qlf_output_hits$table)[qlf_output_hits$table$PValue<0.05] 

heatmap_matrix <- data.matrix(norm_count[which(rownames(norm_count) %in% top_hits),])

# Age specifiers
Age=c(22,23,25,25,26,26,27,28,31,31,31,33,35,37,37,38,38,42,42,45,45,47,
      47,51,52,52,52,52,54,57,58,60,62,63,63,67,67,67,69,70,70,72,72,72,
      72,72,73,76,80,81,81,81,83)

GLM_HeatMap = heatmap.2(heatmap_matrix, #input CPM data for RNAs
                  xlab="Age (yr)", 
                  main="Generalized Linear Model", 
                  labCol = Age, #participants' ages labeled on x-axis
                  Colv = NA, #hide column dendrogram
                  Rowv = T,  #show row dendrogram
                  scale="row", #values centered and scaled in row direction
                  dendrogram = "row", #create row dendrogram
                  labRow = numeric(), #adds y-axis label
                  col=brewer.pal(4,"YlGnBu"), #specify color scheme and how many colors we'll use from it
                  trace = "none", #add lines row-wise and/or column-wise to show proportionality of measurement (e.g. if trace="row", row-wise lines with express how large z-scores are for every cell) in this case this option adds too much noise to the heat map
                  margins = c(5, 2),
                  key=TRUE, #add key to show z-scores (standard deviation)
                  key.title= "SD", #add key title
                  key.xlab="Row Z-Score", #add x-axis label to key
                  key.ylab = "CPM", #add y-axis label to key
                  cexCol = 1, #change size of axis labels
                  density.info = "none", #shows distribution of z-scores (RNA expression levels) of all cells in key (usually normal)
                  lhei = c(1,4), #adjust height and width of heat map
                  key.par = list(cex=0.7,mgp=c(1.5, 0.5, 0), mar=c(3, 2.5, 1, 0)), #adjust size of key
                  
                  distfun = function(heatmap_matrix) as.dist(1-cor(t(heatmap_matrix))), #distance function
                  
                  hclustfun = function(heatmap_matrix) hclust(heatmap_matrix, method = "complete", members = NULL) #clustering method
)
Heat map representing the expression levels of 2292 significant RNAs with age, found by the Generalized Linear Model under P < 0.05. Row-wise z-scores were calculated and color-coded to display changes in RNA expression with increasing age (22–83 years old). Yellow and blue bars refer to RNAs with very low and very high expression levels respectively, at a particular age.

Heat map representing the expression levels of 2292 significant RNAs with age, found by the Generalized Linear Model under P < 0.05. Row-wise z-scores were calculated and color-coded to display changes in RNA expression with increasing age (22–83 years old). Yellow and blue bars refer to RNAs with very low and very high expression levels respectively, at a particular age.

Part II: General Linear Model

  • While GLM uses likelihood for modeling, LM uses least squares.The latter has recently been advocated to more reliably maintain control of type I error rates in tests for no association, while seemingly losing little in power. Here, in the spirit of trying, I will try applying a ordinary linear regression model to fit the data.
  • Here in this model, β-value refers to the “power” of a test — which is one minus the probability of type II error. Similar to the function of logFC (log fold change) calculated by edgeR, β-values implemented help us differentiate between the up (β > 0) and down (β < 0) regulated genes in older compared to younger individuals.
#################################################################
### LINEAR MODEL with detailed age and sex
norm_count=as.data.frame(cpm(d,log=T))

# Set up P and Beta values
p.val.lin=numeric()
beta.lin=numeric()
count0=numeric()

# Age specifiers (previously defined)
# Sex specifiers: 1 for male, 0 for female
sex=as.factor(c(1,1,1,0,0,0,0,1,1,1,0,1,
                1,0,1,1,1,0,1,1,1,1,0,
                1,1,0,1,0,1,1,0,1,0,1,0,
                1,1,0,0,1,1,1,0,0,1,1,1,1,
                0,1,1,1,1))


# Fit the linear model to data 
for (i in 1:36084) {
  #print(i)
  model=lm(t(norm_count[i,])~Age+sex,data=norm_count)
  # coefficients are age & sex
  p.val.lin[i]=(summary(model)$coefficients[2,4])
  beta.lin[i]=summary(model)$coefficients[2,1]
  
}

head(model)
## $coefficients
##  (Intercept)          Age         sex1 
## -2.841607604  0.007816151 -0.553526570 
## 
## $residuals
##           1           2           3           4           5           6 
##  2.06631409  1.39653077 -1.61798828  0.81780961 -0.03470712  0.27609179 
##           7           8           9          10          11          12 
## -0.98006789 -1.64143673 -1.66488518  1.00227663  1.77915143  2.12709346 
##          13          14          15          16          17          18 
##  2.14438168  2.30416091 -1.01316059 -0.92730147 -1.71959824 -2.30438942 
##          19          20          21          22          23          24 
## -1.75086285 -0.35470041 -0.34157551  0.97377896 -2.34347017 -0.40512175 
##          25          26          27          28          29          30 
##  2.58305699 -0.11958287  0.81761098  0.01120625 -1.84465666  0.59383784 
##          31          32          33          34          35          36 
## -0.59480849  0.94231775 -2.46071245 -0.63189359  0.53027584 -1.94626663 
##          37          38          39          40          41          42 
##  2.34896475 -0.28892781 -0.20035314 -0.34858546 -0.85075213 -0.34914146 
##          43          44          45          46          47          48 
## -0.85435617  2.18421353  0.50998157  0.31884955 -1.99316354  0.14478867 
##          49          50          51          52          53 
##  2.27846616  0.63733604  2.52202984  0.34326597 -2.07132505 
## 
## $effects
##   (Intercept)           Age          sex1                             
## 20.3615195544  0.9056985323  1.9024372525  0.0136142385 -0.8316888542 
##                                                                       
## -0.5208899443 -1.7698359862 -1.8780780129 -1.8798855521  0.7872762662 
##                                                                       
##  1.0182378904  1.9265203696  1.9582358671  1.5865291934 -1.1848791312 
##                                                                       
## -1.0918063702 -1.8841031434 -2.9859529392 -1.8865131956 -0.4687098456 
##                                                                       
## -0.4555849411  0.8741968022 -2.9889655044 -0.4758493513  2.5195430293 
##                                                                       
## -0.7290100113  0.7540970143 -0.5982208920 -1.8937433522  0.5663920668 
##                                                                       
## -1.1609537960  0.9365128930 -2.9980032002 -0.6160575329  0.0001987281 
##                                                                       
## -1.9015760218  2.3936553591 -0.7901503730 -0.6871484238 -0.2822539382 
##                                                                       
## -0.7844206003 -0.2683826570 -1.3195105420  1.7190591608  0.5907403674 
##                                                                       
##  0.3996083541 -1.9051911001  0.2544020283  1.8710208981  0.7830175834 
##                                           
##  2.6677113827  0.4889475180 -1.9112162306 
## 
## $rank
## [1] 3
## 
## $fitted.values
##         1         2         3         4         5         6         7         8 
## -3.223179 -3.215363 -3.199730 -2.646204 -2.638388 -2.638388 -2.630572 -3.176282 
##         9        10        11        12        13        14        15        16 
## -3.152833 -3.152833 -2.599307 -3.137201 -3.121569 -2.552410 -3.105937 -3.098120 
##        17        18        19        20        21        22        23        24 
## -3.098120 -2.513329 -3.066856 -3.043407 -3.043407 -3.027775 -2.474248 -2.996510 
##        25        26        27        28        29        30        31        32 
## -2.988694 -2.435168 -2.988694 -2.435168 -2.973062 -2.949614 -2.388271 -2.926165 
##        33        34        35        36        37        38        39        40 
## -2.357006 -2.902717 -2.349190 -2.871452 -2.871452 -2.317925 -2.302293 -2.848004 
##        41        42        43        44        45        46        47        48 
## -2.848004 -2.832371 -2.278845 -2.278845 -2.832371 -2.832371 -2.824555 -2.801107 
##        49        50        51        52        53 
## -2.216315 -2.762026 -2.762026 -2.762026 -2.746394 
## 
## $assign
## [1] 0 1 2

LINEAR MODEL OUTPUT REPORT

# Genes that passed the threshold p-value < 0.05: 1381
# Genes that passed the threshold p-value < 0.01: 297
# Genes up-regulated/over-represented with age, under p-value < 0.01: 240
# Genes down-regulated/under-represented with age, under p-value < 0.01: 57

# Genes that passed the BH multiple testing correction: 0 :(

### LINEAR  MODEL OUPUT REPORT:
# Note: here we selected a conservative p-value threshold of P < 0.01
# A reminder about error. Type I error is the false rejection of the null hypothesis and type II error is the false acceptance of the null hypothesis. 

### 297 genes have been identified as significantly expressed: of which 57 are under-represented (P < 0.01 ,  negative β values for age) and 240 are over-represented (P < 0.01, positive β values for age)
neg_betas= which(beta.lin < 0)
pos_betas= which(beta.lin > 0)
# 1318 genes that passed the threshold p-value < 0.05
lm_total_0.05 <- rownames(norm_count)[which(p.val.lin<0.05)]


# 297 genes that passed the threshold p-value < 0.01
lm_total <- rownames(norm_count)[which(p.val.lin<0.01)]
# Under p-value < 0.01, 240 genes are over-expressed
lm_over <- rownames(norm_count)[which(p.val.lin[pos_betas]<0.01)]
# Under p-value < 0.01, 57 genes are under-expressed
lm_under <- rownames(norm_count)[which(p.val.lin[neg_betas]<0.01)]


### False Discovery Rate (FDR) p-Value Correction:
adjp=p.adjust(p.val.lin,method="BH",n=length(p.val.lin))
# Same as for Model I, Benjamini-Hochberg (aka "fdr") method is used here to control the false discovery rate.
plot(beta.lin,-log10(adjp),xlab="Beta Value", ylab="-log10(adj p-val)",
     main="FDR Adjusted p-Values from Linear Regressions")
legend("topleft",c("p=0.05","p=0.01"),col=c("red","blue"),
        lty=1:1,cex=0.8)
abline(1.3,0,col="red")
abline(2,0,col="blue")
abline(v=0,col="black")

Yikes no genes past the multiple testing correction…
Which means this linear model does NOT fit our data as well as the previous model.
However …we can still try to visualize it:

### LINEAR Model volcano plot
lin.info <- data.frame(p.val.lin,beta.lin)
# head(lin.info)
lin.info$threshold <- factor(ifelse(lin.info$p.val.lin<0.01 & lin.info$beta.lin < 0, 1, 
                                 ifelse(lin.info$p.val.lin<0.01 & lin.info$beta.lin > 0,-1,0)))
options(scipen=999)  # turn-off scientific notation like 1e+48

linear_Vplot <- ggplot(lin.info, aes(x=beta.lin, y=-log10(p.val.lin))) + 
  geom_point(aes(color=lin.info$threshold),show.legend = T)+
  scale_color_manual(values = c("0" = "black", "-1" = "red", "1"="blue"), labels = c("Not significant", "Over-represented", "Under-represented"))+
  scale_shape(solid = F)+
  geom_hline(aes(yintercept= -log10(0.01), linetype = "p = 0.01"), colour= 'black',linetype="dashed") +
  geom_vline(xintercept = 0,color="black")+
  xlim(c(-0.07,0.07))+ylim(c(0,10))+
  theme(text = element_text(size=10))+
  labs(y="-log10(p-val)", 
       x="Beta Coefficient",
       title = "Linear Model of Significant RNAs",
       color = "Legend")

plot(linear_Vplot) 

# Volcano plot of the linear model output is surpisingly nice and clean? 
# Alternative graphic summaries of data from the edgeR package: MA plot
# plotSmear(est_d)
#Assign/name CPM data
norm_count_heatmap = data.matrix(norm_count[which(p.val.lin<0.01), ])

# Ah the struggle with graph margins returns...
LM_HeatMap = heatmap.2(norm_count_heatmap, #input CPM data for RNAs
                  xlab="Age (yr)", 
                  main="Linear Model", 
                  labCol = Age, #participants' ages labeled on x-axis
                  Colv = NA, #hide column dendrogram
                  Rowv = T,  #show row dendrogram
                  scale="row", #values centered and scaled in row direction
                  dendrogram = "row", #create row dendrogram
                  labRow = numeric(), #adds y-axis label
                  col=brewer.pal(4,"YlGnBu"), #specify color scheme and how many colors we'll use from it
                  trace = "none", #add lines row-wise and/or column-wise to show proportionality of measurement (e.g. if trace="row", row-wise lines with express how large z-scores are for every cell) in this case this option adds too much noise to the heat map
                  margins = c(5, 2),
                  key=TRUE, #add key to show z-scores (standard deviation)
                  key.title= "SD", #add key title
                  key.xlab="Row Z-Score", #add x-axis label to key
                  key.ylab = "CPM", #add y-axis label to key
                  cexCol = 1, #change size of axis labels
                  density.info = "none", #shows distribution of z-scores (RNA expression levels) of all cells in key (usually normal)
                  lhei = c(1,4), #adjust height and width of heat map
                  key.par = list(cex=0.7,mgp=c(1.5, 0.5, 0), mar=c(3, 2.5, 1, 0)), #adjust size of key
                  
                  distfun = function(norm_count_heatmap) as.dist(1-cor(t(norm_count_heatmap))), #distance function
                  
                  hclustfun = function(norm_count_heatmap) hclust(norm_count_heatmap, method = "complete", members = NULL) #clustering method
)
Heat map representing the expression levels of 2292 significant RNAs with age, found by the Generalized Linear Model under P < 0.05. Row-wise z-scores were calculated and color-coded to display changes in RNA expression with increasing age (22–83 years old). Yellow and blue bars refer to RNAs with very low and very high expression levels respectively, at a particular age. Additionally, a correlation-based distance method was used to cluster the RNAs obtained from the linear model with positive and negative β-values for age. RNAs with positive β-values are shown in the top of the heat map, containing RNAs with low to high expression (yellow to blue rows) with increasing age. Likewise, RNAs with negative β-values are shown on the bottom of the heat map, containing RNAs with high to low expression (blue to yellow rows) with increasing age.

Heat map representing the expression levels of 2292 significant RNAs with age, found by the Generalized Linear Model under P < 0.05. Row-wise z-scores were calculated and color-coded to display changes in RNA expression with increasing age (22–83 years old). Yellow and blue bars refer to RNAs with very low and very high expression levels respectively, at a particular age. Additionally, a correlation-based distance method was used to cluster the RNAs obtained from the linear model with positive and negative β-values for age. RNAs with positive β-values are shown in the top of the heat map, containing RNAs with low to high expression (yellow to blue rows) with increasing age. Likewise, RNAs with negative β-values are shown on the bottom of the heat map, containing RNAs with high to low expression (blue to yellow rows) with increasing age.

Part III: MODEL COMPARISON

tippytop.p.val.glm=which(qlf_output_hits$table$PValue<0.01) #899
top.p.val.glm=which(qlf_output_hits$table$PValue<0.05) #2292
glm_corrected <- which(qlf_output_hits$table$FDR < 0.05) #397 :)

tippytop.p.val.lin=which(p.val.lin<0.01) #297
top.p.val.lin=which(p.val.lin<0.05) #1318
lm_corrected <- which(adjp<0.05) # sadly none..

# Side Note: there's 73 genes overlapping from the two models at PValue < 0.05, these genes may be worth looking at
model_intersect_genes = norm_count[intersect(which(p.val.lin<0.05),which(qlf_output_hits$table$PValue<0.05)),] # their gene names and p_values are stored here
write.table(rownames(model_intersect_genes),
            file = file.path("/Users/apple/bcb420_code/DE_data","model_intersect_genes.txt"), sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)

Model I: Genewise Negative Binomial Generalized Linear Model
Number of genes that fall below threshold p-value of 0.05: 2292
Number of genes that fall below threshold p-value of 0.01: 899
Number of genes that passed FDR correction: 397

Model II: Linear Model
Number of genes that fall below threshold p-value of 0.05: 1318
Number of genes that fall below threshold p-value of 0.01: 297
Number of genes that passed FDR correction: 0

  • In conclusion, it makes sense to move forward with ORA using Model I, since it has not only identified more significant genes, but also has better fit. Gene list produced by Model II, when placed in g:Profiler will not pass FDR, meaning it will not yield any useful pathway information.

Part 2: Thresholded Over-Representation

This section will use the gene list produced by the previous differential expression analysis to perform over-representation analysis (ORA). This will take into account the list of either up-regulated or down-regulated (or both) gene IDs, and test whether a gene set contains disproportionately many genes of significant expression change.

*Note: the difference between ORA and ‘gene set enrichment analysis’ (GSEA) is their inclusion level. ORA reduces the full gene expression matrix to those passing a pre-determined threshold prior to analysis while GSEA first compute “differential expression scores” for ALL genes measured before evaluating their association with gene sets / biological pathways.

2.1 Method Selection

Enrichment Method

  • ORA is performed using g:Profiler[2], which employs the hypergeometric test (Fisher’s Exact) test to measure the significance of functional gene set terms in the input gene list. This method seeks to identify gene sets that share an unusually large number of genes with a list derived from experimental measurements, which in turn provided evidence that our gene list is likely associated with such gene sets.

Annotation Databases

  • Our choice of annotation databases include:  
  1. KEGG[3],a manually curated source of high-level biochemical functions and pathways. g:Profiler claims to update external databases like KEGG, Reactome, and Wikipathways during the same update routine straight from the data source. The latest version of KEGG is released on Jan 1, 2022.

  2. Reactome[4], another biological pathway database that covers the same number of genes as KEGG. However, KEGG covers more broad term pathways while Reactome contains more detailed entries. The latest version of Reactome is released on Dec 13, 2021.

  3. WikiPathways[5], an alternative pathway database. The latest version of WikiPathway is released on Feb 10, 2022.

Note: we are using all three pathway databases available on g:Profiler to better cross-examine the output gene set terms identified by them. For example, semantically analogous pathways (e.g. “Notch signaling pathway” in KEGG and “Signaling by NOTCH” pathway in Reactome) across multiple databases are likely better descriptors of our gene lists than pathways found by only one source.

  1. GO:biological process. Gene Ontology (GO) annotation is a most commonly used gene set annotation consortium. GO often describe biological processes by their outcomes or ending states. Compared to pathway databases like KEGG, which labels genes using broad biological processes, GO gives more flexibility and capture the complexity of functional relationships well. (e.g. for the same input gene list, KEGG may give “DNA replication” while GO may break down the process into “origin-licensing,” “DNA polymerization”…etc)

Threshold Correction

  • The Benjamini-Hochberg False Discovery Rate (FDR) approach is applied in g:Profiler to correct for false significant matches (Type I errors) within the results.

2.2 ORA Result Presentation & Interpreation

Input: Up-Regulated Genes

  • Length of input gene list = 528

  • Term size limit = 20 ~ 1000 genes (given that out RNA-seq data is not disease-oriented but extracted from healthy participants, we are more interested in broader pathways that are highlighted in muscle aging)

  • Under the BH-FDR corrected significance threshold of 0.05, and a term size limit of 20~500:
    #Gene sets returned by GO:BP = 66
    #Gene sets returned by KEGG = 14
    #Gene sets returned by REAC = 46
    #Gene sets returned by WP = 6

  • Output across all databases: up-regulated-overview

  • Notable Pathways (edited results, not entire output image): up-regulated-pathways Interpretation:
    For genes that are up-regulated with age, the top term identified across multiple databases is “cytoplasmic ribosomal protein translation”. This makes sense as aging is closely associated with the proteome, the gradual accumulation of mutations that arise in the aging process will lead to impaired protein synthesis. Another pathway tightly linked to aging is “oxidative phosphorylation”. During the aging process, the efficiency of oxidative phosphorylation tend to decline as a result of mitochondrial dysfunctions. Oxidative phosphorylation is also identified as a age-related pathway by authors of the study. The KEGG database is used by the original publication, from which they identified age-related sarcopenia (muscle wasting) and several other age-related neuromuscular diseases. However, according our results, the terms identified by KEGG does not seem to correlate perfectly with the authors’ findings. Several clearly non-related diseases were identified (e.g. COVID-19). This may be due to recent updates in the database that included novel gene-process associations for prominent diseases. In addition, the authors had mentioned that the top positively correlated protein–mRNA pairs were found involved in glycolysis/gluconeogenesis using KEGG databse. This is not reflected in our analysis.

Input: Down-Regulated Genes

  • Length of input gene list = 321

  • Term size limit = 20 ~ 2000 genes (set to larger terms because we want to highlight some of the broader, more generalized pathways associated with muscle aging e.g. tissue development)

  • Under the BH-FDR corrected significance threshold of 0.05:
    #Gene sets returned by GO:BP = 266
    #Gene sets returned by KEGG = 7
    #Gene sets returned by REAC = 16
    #Gene sets returned by WP = 9

  • Output across all databases: down-regulated-overview

  • Notable Pathways (edited results, not entire output image): down-regulated-pathways Interpretation:
    For genes that are down-regulated with age, large biological processes such as “tissue development” and “regulation of anatomical structure morphogenesis” are the top results returned by GO:BP. This makes sense as an organism’s ability to generate new tissues, as well as the maintenance of structural orientation and signalling naturally deteriorate with age. The same reasoning explains the decline “Collagen formation” and similar biosynthesis pathways identified by the REACTOME database. “Wnt signalling” is another important age-related pathway, it has long be established in literature that the impaired regenerative potential of skeletal muscle with aging is associated with the canonical Wnt signaling regulators, the lack of which will lead to an increase in tissue fibrosis. This finding was discussed in a 2007 paper published in Science[6]. Many pathways (ECM organization; AGE-RAGE signaling.. etc) identified by this analysis are also reflected in the original study.

Input: Total Genes

  • Length of input gene list = 899

  • Term size limit = 20 ~ 1000 genes

  • Under the BH-FDR corrected significance threshold of 0.05, and a term size limit of 20~500:
    #Gene sets returned by GO:BP = 58
    #Gene sets returned by KEGG = 15
    #Gene sets returned by REAC = 52
    #Gene sets returned by WP = 5

  • Output across all databases: up&down

  • Notable Pathways (edited results, not entire output image): pathways Interpretation:
    Compared to ORA performed using lists of up and down-regulated gene separately, the combined gene list returned a combination of terms from both analysis, which is expected. From this result, we can see that the strength/significance of top pathways associated with up-regulated genes seem to slightly trump the top terms associated with down-regulated genes. However, these results are difficult to interpret as they do not reveal the direction of association.

Table of Notable Genes: here I have highlighted three over-expressed genes with aging and three under-expressed genes with aging. These RNAs are amongt the top 20 ranked in their respective gene lists. The function and characteristics of these genes align with our expectation of muscle aging. For example, angiotensin-converting enzyme 2 (ACE2), an emzyme that protects agains aging-associated muscle wasting, is down-regulated with age. This means that as individuals age, they tend to be more prone to muscle wasting diseases.

Expression with age Common Name (RNA Ensembl ID) Padj Protein Coding Associated Functions & Characteristics
up XIST(ENSG00000229807) 4.35E-08 no Age-related X chromosome inactivation, can clinically result in late-onset X-linked disoders (e.g. muscular dystrophy)
up C12orf75(ENSG00000235162) 3.05E-06 yes Energy metabolism, insulin signalling, aging, skeletal muscle weakness
up FAM83B(ENSG00000168143) 1.99E-05 yes Hypoxia response pathway, aging, musclar dystrophy
down ACE2(ENSG00000130234) 4.35E-04 yes Protection against age-associated muscle wasting (e.g. DND)
down NPNT(ENSG00000168743) 5.18E-03 yes Muscle cell niche maintenance, tissue injury repair and regeneration
down CNBP(ENSG00000168714) 9.98E-03 yes Juvenile tissue development, abudant in the heart and skeletal muscles

Assignment Journal can be found here

References

1. Tumasian RA, Harish A, Kundu G, Yang J-H, Ubaida-Mohien C, Gonzalez-Freire M, et al. Skeletal muscle transcriptome in healthy aging. Nature Communications. 2021;12:1–6.
2. Raudvere U, Kolberg L, Kuzmin I, Arak T, Adler P, Peterson H, et al. G: Profiler: A web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic acids research. 2019;47:W191–8.
3. Kanehisa M, Furumichi M, Sato Y, Ishiguro-Watanabe M, Tanabe M. KEGG: Integrating viruses and cellular organisms. Nucleic acids research. 2021;49:D545–51.
4. Gillespie M, Jassal B, Stephan R, Milacic M, Rothfels K, Senff-Ribeiro A, et al. The reactome pathway knowledgebase 2022. Nucleic acids research. 2022;50:D687–92.
5. Slenter DN, Kutmon M, Hanspers K, Riutta A, Windsor J, Nunes N, et al. WikiPathways: A multifaceted pathway database bridging metabolomics to other omics research. Nucleic acids research. 2018;46:D661–7.
6. Brack AS, Conboy MJ, Roy S, Lee M, Kuo CJ, Keller C, et al. Increased wnt signaling during aging alters muscle stem cell fate and increases fibrosis. Science. 2007;317:807–10.